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Abstract 

Aiming to unify known results about clustering mixtures of distributions under separa- 
tion conditions, Kumar and Kannan |KK10j introduced a deterministic condition for clustering 
datasets. They showed that this single deterministic condition encompasses many previously 
studied clustering assumptions. More specifically, their proximity condition requires that in the 
target fc-clustering, the projection of a point x onto the line joining its cluster center \i and 
some other center (jf, is a large additive factor closer to \i than to fj,'. This additive factor can 
be roughly described as k times the spectral norm of the matrix representing the differences 
between the given (known) dataset and the means of the (unknown) target clustering. Clearly, 
the proximity condition implies center separation - the distance between any two centers must 
be as large as the above mentioned bound. 

In this paper we improve upon the work of Kumar and Kannan jKKlOj along several axes. 
First, we weaken the center separation bound by a factor of y/k, and secondly we weaken 
the proximity condition by a factor of k (in other words, the revised separation condition is 
independent of k). Using these weaker bounds we still achieve the same guarantees when all 
points satisfy the proximity condition. Under the same weaker bounds, we achieve even better 
guarantees when only (1 — effraction of the points satisfy the condition. Specifically, we correctly 
cluster all but a (e + 0(l/c 4 ))-fraction of the points, compared to 0(fc 2 e)-fraction of |KK10j . 
which is meaningful even in the particular setting when e is a constant and k = w(l). Most 
importantly, we greatly simplify the analysis of Kumar and Kannan. In fact, in the bulk of our 
analysis we ignore the proximity condition and use only center separation, along with the simple 
triangle and Markov inequalities. Yet these basic tools suffice to produce a clustering which (i) 
is correct on all but a constant fraction of the points, (ii) has fc-means cost comparable to the 
fc-means cost of the target clustering, and (iii) has centers very close to the target centers. 

Our improved separation condition allows us to match the results of the Planted Partition 
Model of McShcrry [McSOl , improve upon the results of Ostrovsky ct al [ORSS06 , and improve 
separation results for mixture of Gaussian models in a particular setting. 
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1 Introduction 



In the long-studied field of clustering, there has been substantial work |Das99( IDS07} ISK014 IVW02} 
IAM051 ICR08bl IKSV081 IDHKM071 IBV08j studying the problem of clustering data from mixture of 
distributions under the assumption that the means of the distributions are sufficiently far apart. 
Each of these works focuses on one particular type (or family) of distribution, and devise an 
algorithm that successfully clusters datasets that come from that particular type. Typically, they 
show that w.h.p. such datasets have certain nice properties, then use these properties in the 
construction of the clustering algorithm. 

The recent work of Kumar and Kannan |KK10j takes the opposite approach. First, they define 
a separation condition, deterministic and thus not tied to any distribution, and show that any 
set of data points satisfying this condition can be successfully clustered. Having established that, 
they show that many previously studied clustering problems indeed satisfy (w.h.p) this separation 
condition. These clustering problems include Gaussian mixture- models, the Planted Partition 
model of McSherry [McSOlJ and the work of Ostrovsky et al |ORSS06] . In this aspect they aim 
to unify the existing body of work on clustering under separation assumptions, proving that one 
algorithm applies in multiple scenarios 

However, the attempt to unify multiple clustering works is only successful in part. First, 
Kumar and Kannan's analysis is "wasteful" w.r.t the number of clusters k. Clearly, motivated by 
an underlying assumption that k is constant, their separation bound has linear dependence in k 
and their classification guarantee has quadratic dependence on k. As a result, Kumar and Kannan 
overshoot best known bounds for the Planted Partition Model and for mixture of Gaussians by a 
factor of \fk. Similarly, the application to datasets considered by Ostrovsky et al only holds for 
constant k. Secondly, the analysis in Kumar-Kannan is far from simple - it relies on most points 
being "good", and requires multiple iterations of Lloyd steps before converging to good centers. 
Our work addresses these issues. 

To formally define the separation condition of [KK10] . we require some notation. Our input 
consists of n points in R d . We view our dataset as a n x d matrix, A, where each datapoint 
corresponds to a row Ai in this matrix. We assume the existence of a target partition, Ti, T2, . . . ,T^, 
where each cluster's center is /i r = ^^ieT^*' where n r = \T r \. Thus, the target clustering is 
represented by a n x d matrix of cluster centers, C, where Cj = [i r iff i S T r . Therefore, the fe-means 
cost of this partition is the squared Frobenius norm \\A — C\\p, but the focus of this paper is on 
the spectral (L2) norm of the matrix A — C. Indeed, the deterministic equivalent of the maximal 
variance in any direction is, by definition, ^\\A — C\\ 2 = max/ u: |id|=i} n\\(A — C)i>|| 2 . 

Definition. Fix i 6 T r . We say a datapoint A^ satisfies the Kumar-Kannan proximity condition if 
for any s 7^ r, when projecting Ai onto the line connecting fj, r and fi s , the projection of Ai is closer 
to /j, r than to fi s by an additive factor of Q (jz(-^= + ^=)||^4 — C||^ . 

Kumar and Kannan proved that if all but at most e-fraction of the data points satisfy the 
proximity condition, they can find a clustering which is correct on all but an 0(fc 2 e)-fraction of 
the points. In particular, when e = 0, their algorithm clusters all points correctly. Observe, the 
Kumar-Kannan proximity condition gives that the distance \\fj, r — fi s \\ is also bigger than the above 

We comment that, implicitly, Achlioptas and McSherry [AM05] follow a similar approach, yet they focus only on 
mixtures of Gaussians and log-concave distributions. Another deterministic condition for clustering was considered 
by |CO10| . which generalized the Planted Partition Model of [McS01| . 
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mentioned bound. The opposite also holds - one can show that if \\fj, r — fj, s \\ is greater than this 
bound then only few of the points do not satisfy the proximity condition. 

1.1 Our Contribution 

Our Separation Condition. In this work, the bulk of our analysis is based on the following 
quantitatively weaker version of the proximity condition, which we call center separation. Formally, 
we define A r = min{\/fe||^4 — C||, \\A — C\\f} and we assume throughout the paper that for a 
large constant c we have that the means of any two clusters T r and T s satisfy 

|| Mr- - Ms 1 1 > c(A r + A s ) (1) 

Observe that this is a simpler version of the Kumar-Kannan proximity condition, scaled down by a 
factor of yk. Even though we show that (pQ) gives that only a few points do not satisfy the proximity 
condition, our analysis (for the most part) does not partition the dataset into good and bad points, 
based on satisfying or non-satisfying the proximity condition. Instead, our analysis relies on basic 
tools, such as the Markov inequality and the triangle inequality. In that sense one can view our 
work as "aligning" Kumar and Kannan's work with the rest of clustering-under-center-separation 
literature - we show that the bulk of Kannan and Kumar's analysis can be simplified to rely merely 
on center-separation. 

Our results. We improve upon the results of |KK10] along several axes. In addition to the weaker 
condition of Equation (pQ), we also weaken the Kumar-Kannan proximity condition by a factor of 
k, and still retrieve the target clustering, if all points satisfy the (A>weaker) proximity condition. 
Secondly, if at most en points do not satisfy the /c-weaker proximity condition, we show that we can 
correctly classify all but a (e + 0(l/c 4 ))-fraction of the points, improving over the bound of |KK10| 
ofO(A; 2 e). Note that our bound is meaningful even if e is a constant whereas k = w(l). Furthermore, 
we prove that the fc-means cost of the clustering we output is a (1 + 0(l/c))-approximation of the 
fe-means cost of the target clustering. 

Once we have improved on the main theorem of Kumar and Kannan, we derive immediate im- 
provements on its applications. In Section I3TT1 we show our analysis subsumes the work of Ostrovsky 
ct al [ORSS06 , and applies also to non-constant k. Using the fact that Equation ([1]) "shaves off" a 
\fk factor from the separation condition of Kumar and Kannan, we obtain a separation condition 
of n(a max ^/k) for learning a mixture of Gaussians, and we also match the separation results of the 
Planted Partition model of McSherry [McSOlj . These results are described in Section [5j 

From an approximation-algorithms perspective, it is clear why the case of k = w(l) is of 
interest, considering the ubiquity of fc-partition problems in TCS (e.g., k- Median, Max A:-coverage, 
Knapsack for k items, maximizing social welfare in fc-items auction - all trivially simple for constant 
k). In addition, we comment that in our setting only the case where k = uj(1) is of interest, since 
otherwise one can approximate the fe-means cost using the PTAS of Kumar et al |KSS04j . which 

2 We comment that throughout the paper, and much like Kumar and Kannan, we think of c as a large constant 
(c = 100 will do). However, our results also hold when c = uj(1), allowing for a (1 + o(l))-approximation. We also 
comment that we think of d 3> k, so one should expect \\A— C\\ 2 F > C|| 2 to hold, thus the reader should think 

of A r as dependent on -v/fc^ — C||. Still, including the degenerate case, where \\A — C\\ 2 F < k\\A — C\\, simplifies our 
analysis in Section [3] One final comment is that (much like all the work in this field) we assume k is given, as part 
of the input, and not unknown. 
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doesn't even require any separation assumptions. From a practical point of view, there is a variety 
of applications where k is quite large. This includes problems such as clustering images by who is 
in them, clustering protein sequences by families of organisms, and problems such as deduplication 
where multiple databases are combined and entries corresponding to the same true entity are to be 
clustered together |CR02l [MBHC95] . The challenges that arise from treating A; as a non-constant 
are detailed in the proofs overview (Section II. 4p . 

To formally detail our results, we first define some notations and discuss a few preliminary facts. 

1.2 Notations and Preliminaries 

The Frobenius norm of a n x m matrix M, denoted as ||M||jr is defined as ||Af||_p = \J^2i j ^ij ■ 
The spectral norm of M is defined as ||M|| = max a ..ii a .|i = i ||Mx||. It is a well known fact that if 
the rank of M is t, then ||M|||i < i||Af|| 2 . The Singular Value Decomposition (SVD) of M is a 
decomposition of M as M = UT,V T , where [/isanxn unitary matrix, V is a m x m unitary 
matrix, Eisatixm diagonal matrix whose entries are nonnegative real numbers, and its diagonal 
entries satisfy o~\ > a-i > . . . > cr m in{m,n} • The diagonal entries in £ are called the singular values 
of M, and the columns of U and V, denoted it, and resp., are called the left- and right-singular 
vectors. As a convention, when referring to singular vectors, we mean the right-singular vectors. 
Observe that the Singular Value Decomposition allows us to write M = ^£=1* <?iUivf '. Projecting 
M onto its top t singular vectors means taking M = Y2i=i CiUivf . It is a known fact that for any 
t, the t-dimensional subspace which best fits the rows of M, is obtained by projecting M onto 
the subspace spanned by the top t singular vectors (corresponding to the top t singular values). 
Another way to phrase this result is by saying that M = argmin A r :rank ( jV ) =4 {||M — iV||_p}. For 

a proof, see [KV09] . The same matrix, M, also minimizes the spectral norm of this difference, 
meaning M = argmin7 V: rank(A r )=t{||^ — -^11} ( see [GVL96] for proof). 

As previously defined, \\A — C\\ denotes the spectral norm of A — C. The target clustering, T, 
is composed of k clusters T±, T2, ■ ■ ■ , T^. Observe that we use \i as an operator, where for every set 
X, we have n(X) = r^r J2i^x ^i- ^ e abbreviate, and denote [i r = [i(T r ). From this point on, we 

denote the projection of A onto the subspace spanned by its top £>singular vectors as A, and for 
any vector v, we denote v as the projection of v onto this subspace. Throughout the paper, we 
abuse notation and use i to iterate over the rows of A, whereas r and s are used to iterate over 
clusters (or submatrices). So A{ represents the ith row of A whereas A r represents the submatrix 
[Ai]{ieT r }- 

Basic Facts. The analysis of our main theorem makes use of the following facts, from (McSOU 
IKVOQllKKln] . We advise the reader to go over the proofs, which are short, elegant, and provided 
in Appendix [A] The first fact bounds the cost of assigning the points of A to their original centers. 

Fact 1.1 (Lemma 9 from |McS01j ). ||A-C||| < 8min{/c||,4-C|| 2 , p-C|||} (= 8n r A 2 . for every r 

Next, we show that we can match each target center fi r to a unique, relatively close, center v r 
that we get in Part I of the algorithm. 

Fact 1.2 (Claim 1 in Section 3.2 of |KV09| ). For every \x T there exists a center v s s.t. ||/z r — v s \\ < 
6A r , so we can match each fi r to a unique v r . 
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Finally, we exhibit the following fact, which is detailed in the analysis of [KK10| . 

Fact 1.3. Fix a target cluster T r and let S r be a set of points created by removing p ut n r points 
from T r and adding pi n (s)n r points from each cluster s ^ r, s.t. every added point x satisfies 

\\x - /lis 1 1 > \\\x - p r \\- Assume p out < \ and pi n = f ^2 s ^ r Pin{s) < \. Then 



-= {^fp^~t + IY^ Vpinis)^ \\A - C\\ < ^ 



\\p(s r )-p r \\ < I Vp^+IV'v^^) I \\A-c\\<(J^ + lVkJ^)\\A-c 
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1.3 Formal Description of the Algorithm and Our Theorems 

Having established notation, we now present our algorithm, in Figure [TJ Our algorithm's goal is 
three fold: (a) to find a partition that identifies with the target clustering on the majority of the 
points, (b) to have the /c-means cost of this partition comparable with the target, and (c) output 
k centers which are close to the true centers. It is partitioned into 3 parts. Each part requires 
stronger assumptions, allowing us to prove stronger guarantees. 



Part I: Find initial centers: 

• Project A onto the subspace spanned by the top k singular vectors. 

• Run a 10-approximation algorithm^ for the fc-means problem on the projected 
matrix A, and obtain k centers v\, V2, ■ ■ ■ , v^. 

Part II: Set S r <— {i : \\Ai — v r \\ < ^\\Ai — u s \\, for every s} and 9 r <— p(S r ). 

Part III: Repeatedly run Lloyd steps until convergence. 

• Set r <— {i : \\Ai — 6 r \\ < \\A{ — 6 S \\, for every s}. 

• Set 9 r = /i(0 r ). 



"Throughout the paper, we assume the use of a 10-approximation algorithm. Clearly, it is possible to 
use any ^approximation algorithm, assuming c/t is a large enough constant. 

Figure 1: Algorithm ~Cluster 

• Assuming only the center separation of ([I]) , then Part I gives a clustering which (a) is correct 
on at least 1 — 0(c~ 2 ) fraction of the points from each target cluster (Theorem 13. ip . and (b) 
has k- means cost smaller than (1 + 0(l/c))||A — C||^ (Theorem 13.21) . 

• Assuming also that A r = -^=\\A — C\\, i.e. assuming the non- degenerate case where \\A — 

CW'p > /c||^4 — C|| 2 , then Part II finds centers that are 0(1/ c) c ^ ose t° the true centers 

(Theorem 14. ip . As a result (see Section f4 . 1 j) . if (1 — e)n points satisfy the proximity condition 
(weakened by a A; factor,), then we misclassify no more than (e + 0(c _4 ))n points. 

• Assuming all points satisfy the proximity condition (weakened by a fc-factor), Part III finds 
exactly the target partition (Theorem Wl 
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1.4 Organization and Proofs Overview 

Organization. Related work is detailed in Section [2j The analysis of Part I of our algorithms is 
in Section [3l Part I is enough for us to give a "one-line" proof in Section 13.11 showing how the work 
of Ostrovsky et al falls into our framework. The analysis of Part II of the algorithm is in Section 
The improved guarantees we get by applying the algorithm to the Planted Partition model and 
to the Gaussian mixture model are discussed in Section [5l We conclude with an open problem in 
Section [6l 

Proof outline for Section [3l The first part of our analysis is an immediate application of 
Facts 11.11 and 11.21 Our assumption dictates that the distance between any two centers is big 
(> c(A r + A s )). Part I of the algorithm assigns each projected point Ai to the nearest v r instead 
of the true center /i r and Fact 11.21 assures that the distance ||/i r — v T \ is small (< 6A r ). Consider 
a misclassified point Ai, where \\Ai — fi r \\ < \\Ai — fj, a \\ yet \\Ai — u s \\ < \\Ai — v r \\. The triangle 
inequality assures that Ai has a fairly big distance to its true center (> (| — 12) A r ). We deduce that 
each misclassified point contributes 0(c 2 A^) to the fc-means cost of assigning all projected points to 
their true centers. Fact ll.ll bounds this cost by \\A — C\\p < 8n r A 2 , so the Markov inequality proves 
only a few points are misclassified. Additional application of the triangle inequality for misclassified 
points gives that the distance between the original point Ai and a true center p r is comparable to 
the distance \\Ai — p s \\, and so assigning Ai to the cluster s only increases the /c-means cost by a 
small factor. 

Proof outline for Section SI In the second part of our analysis we compare between the true 
clustering T and some proposed clustering S, looking both at the number of misclassified points 
and at the distances between the matching centers \\(M r — 9 r \\. As Kumar and Kannan show, the 
two measurements are related: Fact II .31 shows how the distances between the means depend on the 
number of misclassified points, and the main lemma (Lemma I4.5|) essentially shows the opposite 
direction. These two relations are how Kumar and Kannan show that Lloyd steps converge to good 
centers, yielding clusters with few misclassified points. They repeatedly apply (their version of) the 
main lemma, showing that with each step the distances to the true means decrease and so fewer of 
the good points are misclassified. 

To improve on Kumar and Kannan analysis, we improve on the two above-mentioned relations. 
Lemma 14.51 is a simplification of a lemma from Kumar and Kannan, where instead of projecting 
into a fc-dimensional space, we project only into a 4-dimensional space, thus reducing dependency 
on k. However, the dependency of Fact 11.31 on k is tightd. So in Part II of the algorithm we devise 
sub-clusters S r s.t. pi n (s) = p ou t/k 2 . The crux in devising S r lies in Proposition 14.41 - we show 
that any misclassified projected point i G T s n S r is essentially misclassified by fl r . And since 
(see |AM05j ) ||/x r — fX r \\ < "^^r (compared to the bound \\fi r — u r \\ < 6A r ), we are able to give a 
good bound on pi n (s). 

Recall that we rely only on center separation rather than a large batch of points satisfying 
the Kumar-Kannan separation, and so we do not apply iterative Lloyd steps (unless all points are 
good). Instead, we apply the main lemma only once, w.r.t to the misclassified points in T s CiS r , and 
deduce that the distances ||// r — 6 r \\ are small. In other words, Part II is a single step that retrieve 

3 In fact, Fact 11.31 is exactly why the case of k = w(l) is hard - because the L\ and L2 norms of the vector 
(-7=, . . . , -4=) are not comparable for non-constant k. 
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centers whose distances to the original centers are y/k-times better than the centers retrieved by 
Kumar and Kannan in numerous Lloyd iterations. 
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We would like to thanks Avrim Blum for multiple helpful discussions and suggestions. We thank 
Amit Kumar for clarifying a certain point in the original Kumar and Kannan paper. We thank the 
anonymous referees for their suggestions, and especially regarding a discussion about the result of 
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2 Related Work 

The work of [Das991 was the first to give theoretical guarantees for the problem of learning a 
mixture of Gaussians under separation conditions. He showed that one can learn a mixture of k 
spherical Gaussians provided that the separation between the cluster means is £l(y/n(a r + cr s )) and 
the mixing weights are not too small. Here denotes the maximum variance of cluster r along 
any direction. This separation was improved to 0((a> + a s )n 1 ^) by |DS07j . Arora and Kannan 
|SK01] extended these results to the case of general Gaussians. For the case of spherical Gaussians, 
|VW02| showed that one can learn under a much weaker separation of Cl((a r + cr^/c 1 / 4 ). This was 
extended to arbitrary Gaussians by [AM05| and to various other distributions by |KSV08| , although 
requiring a larger separation. In particular, the work of [AM05j requires a separation of J7((cr r + 
a s )( . 1 +A/fclog(fcmin{2 fc ,n}))) whereas IKSV08] require a separation of O(-^-^-(0y+er s )). 

Here uv's refer to the mixing weights. |GR08al ICR08b| gave algorithms for clustering mixtures of 
product distributions and mixtures of heavy tailed distributions. |BV08j gave an algorithm for 
clustering the mixture of 2 Gaussians assuming only that the two Gaussians are separated by a 
hyperplane. They also give results for learning a mixture of k > 2 Gaussians. The work of [KMV10] 
gave an algorithm for learning a mixture of 2 Gaussians, with provably minimal assumptions. This 
was extended in [MV10J to the case when k > 2 although the algorithm runs in time exponential in 
k. Similar results were obtained in the work of [BS10] who can also learn more general distribution 
families. The work of [COlOj studied a deterministic separation condition required for efficient 
clustering. The precise condition presented in [COlOj is technical but essentially assumes that the 
underlying graph over the set of points has a "low rank structure" and presents an algorithm to 
recover this structure which is then enough to cluster well. In addition, previous works (e.g. [SchOQ|, 
IBBG09] ) addressed the problem of clustering from the viewpoint of minimizing the number of 
mislabeled points. 

There has been an extensive line of work on approximation algorithms for the fc-means prob- 
lem f jOROOl IBHPI021 ldlVKKR03l lESOl iHPMOl lKMN+02] ). The current best guarantee is a 
(9 + e)-approximation algorithm of [KMN + 02] (with a much simpler analysis in |GT08j ) if poly- 
nomial dependence on k and the dimension d is desiredH Another popular algorithm for /c-means 
is the Lloyd's heuristics ( |Llo82| ). This heuristics, combined with a careful seeding of centers, has 
been shown to have good performance if the data is well separated (see [ORSS06] ). or to pro- 
vide 0(log(/c))-approximation in general |AV07j . The separation-based results of [ORSS06] were 
improved by |ABS10j . 

4 For constant k, |KSS04| give a PTAS for the fc-means problem. 
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3 Part I of the Algorithm 

In this section, we look only at Part I of our algorithm. Our approximation algorithm defines a 
clustering Z, where Z r = {i : \\Ai — i/ r \\ < \\Ai — u s \\ for every s}. Our goal in this section is 
to show that Z is correct on all but a small constant fraction of the points, and furthermore, the 
fc-means cost of Z is no more than (1 + 0(1/ c)) times the k- means cost of the target clustering. 

Theorem 3.1. There exists a matching (given by Fact M.fy) between the target clustering T and 
the clustering Z = {Z r } r where Z r = {i : \\A{ — u r \\ < \\A{ — v s \\ for every s} that satisfies the 
following properties: 

• For every cluster T SQ in the target clustering, no more than O(l/c 2 )\T S0 \ points are misclas- 
sified. 

• For every cluster Z ro in the clustering that the algorithm outputs, we add no more than 
O(l/c 2 )\T r0 \ points from other clusters. 

• At most 0(l/c 2 )\T r2 \ points are misclassified overall, where T r2 is the second largest cluster. 

Proof. Let us denote T s ^ r as the set of points Ai that are assigned to T s in the target clustering, 
yet are closer to v r than to any other v' r . From triangle inequality we have that — /i s || > 
\\Ai — v s \\ — ||/i s — v s \\. We know from Fact 11.21 that ||/x s — u s \\ < 6A S . Also, since Ai is closer to v r 
than to u s , the triangle inequality gives that 2\\Ai — v s \\ > \\v r — v s \. So, 

11 c 

- Ms || > ^ IK ~ v s\\ - 6A S > -\\n r - n s \\ - 12(A r + A s ) > -(A r + A s ) 

Thus, we can look at ||vl — C|||i, and using Fact 11.11 we immediately have that for every fixed r' 

E E l T ^|^(A r + A,) 2 < E E " = H i " ^ 8n r' A r> 

r s^r r i£T r 

The proof of the theorem follows from fixing some ro or some sq and deducing: 
A 2 So Y, \ T ^A < Yl l T ^r|(A r + A S0 ) 2 < ^^|T s ^ r |(A r + A,) 2 < ^n S() A 2 
A 2 Y, l T ^ol < E l r ^r l(A ro + A s ) 2 < ^^|T s ^|(A r + A s ) 2 < ^n ro A 2 



Observe that for every r/swe have that A r + A s > A r2 (where r2 is the cluster with the second 
largest number of points) , so we have that 

A r 2 EE' T ^l ^ EEl T — K^- + ^) 2 < ^r 2 A 2 r 2 □ 
r sy^r r s^r 

We now show that the /c-means cost of Z is close to the fc-means cost of T ■ Observe that the 
A:-means cost of Z is computed w.r.t the best center of each cluster (i.e., /i(Z r )), and not w.r.t the 
centers v r . 
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Theorem 3.2. The k-means cost of Z is at most (1 + 0(l/c))\\A — C\\ 2 F . 

Proof. Given Z, it is clear that the centers that minimize its fc-means cost are fi(Z r ) = r^-r Siez r A- 
Recall that the majority of points in each Z r belong to a unique T r , and so, throughout this sec- 
tion, we assume that all points in Z r were assigned to fi r , and not to fi(Z r ). (Clearly, this can 
only increase the cost.) We show that by assigning the points of Z r to fj, r , our cost is at most 
(l + 0(l/c))\\A-C\\ 2 F , and so Theorem EJ follows. In fact, we show something stronger. We show 
that by assigning all the points in Z r to fi r , each point Ai pays no more than (l+0(l/c))||Aj — Cj|| 2 . 
This is clearly true for all the points in Z r CiT r . We show this also holds for the misclassified points. 

Because i £ T s ^ r , it holds that \\Ai — u r \\ < \\Ai — i/ s \\. Observe that for every s we have that 
\\Ai — u s \\ 2 = \\Ai — Ai\\ 2 + \\Ai — v s \\ 2 , because A4 — v s is the projection of A4 — v s onto the subspace 
spanned by the top fc-singular vectors of A. Therefore, it is also true that \\Ai — v r \\ < \\Ai — v s \\. 
Because of Fact 11.21 we have that \\fi r — v r \\ < 6A r and — z^ s || < 6A S , so we apply the triangle 
inequality and get 

( 6(A r + A s ) 
\\Ai - fi r \\ < \\Ai - (j, s \\ + \\(j, r - v r \\ + Wn, -v 8 \\ < \\Ai - ij, s \\ 1 + 



\Ai — fi s I 

So all we need to do is to lower bound \\Ai — /j, s \\. As noted, \\Ai — u s \\ > \\Ai — v s \\. Thus 

\\Ai — fi s \\ > \\Ai — i/ s || — 6A r > \\Ai — v 8 \\ — 6A r > — \\v s — u r \\ — 6A r > — c(A r + A s ) 
and we have the bound \\A{ — fi r \\ < (l + — ) \\A{ — fj, s \\, so \\Ai — /i r || 2 < (l + — ) \\A — n s \\ 2 - D 
3.1 Application: The ORSS-Separation 

One straight-forward application of Theorem 13.21 is for the datasets considered by Ostrovsky et 
al [ORSS06] . where the optimal /c-means cost is an e- fraction of the optimal (k — l)-means cost. 
Ostrovsky et al proved that for such datasets a variant of the Lloyd method converges to a good 
solution in polynomial time. Kumar and Kannan have shown that datasets satisfying the ORSS- 
separation, also have the property that most points satisfy their proximity-condition. Their analysis 
is not immediate, and gives a (1 + 0(\/^e))-approximation. Here, we provide a "one-line" proof 
that Part I of Algorithm ~Cluster yields a (1 + 0(y / e))-approximation, for any k. 

Suppose we have a dataset satisfying the ORSS-separation condition, so any (k — l)-partition 
of the dataset have cost > j\\A — C\\ 2 F . For any r and any s 7^ r, by assigning all the points in T r 
to the center fi s , we get some (k — l)-partition whose cost is exactly \\A — C\\ F + n r ||^ r — fi s \\ 2 , so 

H/-V — Mall — ^Jft \\A — C\\f- Setting c = 0(l/y/e), Theorem 13.21 is immediate. 



4 Part II of the Algorithm 

In this section, our goal is to show that Part II of our algorithm gives centers that are very close to 
the target clusters. We should note that from this point on, we assume we are in the non-degenerate 
case, where \\A - Cf F > k\\A - C\\ 2 . Therefore, A r = 4L\\A- C\\. 

Recall, in Part II we define the sets S r = {i : \\A{ — u r \\ < ^\\Ai — u s \\, Vs ^ r}. Observe, these 
set do not define a partition of the dataset! There are some points that are not assigned to any S r . 
However, we only use the centers of S r . We prove the following theorem. 
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Theorem 4.1. Denote S r = {i : ||Aj — u r \\ < |||Aj — u s \\, Vs ^ r}. Then for every r it holds that 
MS r ) - ^|| = 0(l/c) ^=|L4 - C|| = 0(^A r ). 

The proof of Theorem 14.11 is an immediate application of Fact 11.31 combined with the following 
two lemmas, that bound the number of misclassified points. Observe that for every point that 
belongs to T s yet is assigned to S r (for s ^ r) is also assigned to Z r in the clustering Z discussed 
in the previous section. Therefore, any misclassified point i G T s n S r satisfies that \\Ai — /i r || < 
(1 + 0(c -1 ))||j4j — [i s \\ as the proof of Theorem 13.21 shows. So all conditions of Fact 11.31 hold. 

Lemma 4.2. Assume that for every r we have that ||/i r — v r \\ < 6A r . Then at most ^-n r points 
of T r do not belong to S r . 

Lemma 4.3. Redefine T s ^ r as the set T s n S r . Assume that for every r we have that \\fj, r — v r \\ < 
6A r . Then for every r and every s ^ r we have that |r s _s. r | = (^q^j n T . 



Proof of Lemma ^..2 First, we claim that if i is such that \\Ai — [i r \\ < |A r , then it must be 
the case that i G S r . 

This is a simple consequence of the triangle inequality, bounding ||Aj — u r \\ < ||Aj — fj, r \\ + 
\\fj, r — i/ r \\ < ((c/8) + 6)A r . Yet, for every s ^ r, the triangle inequality gives that ||Aj — u s \\ > 
\\fi r — fi s \\ — \\Ai — fj, r \\ — \\fi s — v s \\ > (c — | — 6)(A r + A s ). Assuming c > 48, we have that 

1 1 -^i ^ ' s || ^ ^|| A-j, 1 1 • 

All that's left is to show that the number of i G T r s.t. ||Aj — /U r || > |A r is small. This again 

follows from the Markov inequality: Since \\A — C\\ 2 F < 8fc||A — C|| 2 , then the number of such points 

■ , , 8fc||A-C|| 2 ,-, 

is at most ( c 2/ 64 ) fc || yl _ C |pn r . u 

We now turn to proving Lemma 14.31 The general outline of the proof of Lemma 14.31 resembles 
to the outline of the proof of Lemma [4.21 Proposition 14.41 exhibit some property that every point in 
T s _>. r must satisfy, and then we show that only few of the points in T s satisfy this property. Recall 
that fi r indicates the projection of [i r onto the subspace spanned by the top fc-singular vectors of 
A. 

Proposition 4.4. Fixi£T s s.t. ||Aj — p, s \\ < 2||Aj — jl r \\ . Then ||Aj — u s \\ < 3||Aj — u r \\, so i ^ S r . 
Proof. First, for every r we have that \\fl r — v r \\ < \\(j, r — v r \\ < 6A r , as fl r — v T is a projection of 

Let us fiddle with the triangle inequality, in order to obtain a lower bound on ||Aj — v T \. We 
have that 3\\Ai - fi r \\ > ||/t r - As II > H^r — A*s|| - (||A*r - v r\\ + \\v r — Aril) - (||a*s — v s \\ + \\u s — Asll) > 
(c - 12)(A r + A s ), thus \\Ai -u r \\> - 6) (A r + A s ). 

Assume for the sake of contradiction that \\Ai — v s \\ > 3||j4j — u r \\, and let us show this yields 
an upper bound on \\Ai — u r \\, which contradicts our lower bound. We have that 

6A S > \\Ai — v s \\ — \\Ai — As|| > 3\\Ai — f r || — 2\\Ai — Ar|| > \\Ai — f r || — 2 • 6A r 

It follows that 12(A r + A s ) > \\Ai - v r \\ > - 6) (A r + A s ). Contradiction (c > 60). □ 

Proposition 14. 4| shows that in order to bound |T' s _s. T .| it suffices to bound the number of points 
in T s satisfying ||Aj — As|| > 2||ylj — Ar||- The major tool in providing this bound is the following 
technical lemma. This lemma is a variation on the work of |KK10j . on which we improve on the 
dependency on k and simplify the proof. 
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Lemma 4.5 (Main Lemma). Fix a,/3 > 0. Fix r ^ s and let Cr an d Cs be two points s.t. 
1 1 Mr — Cr|| — «A r and \\fi s ~ Cs\\ — a ^-s- We denote Ai as the projection of Ai onto the line 
connecting Cr and Cs- Define X = |i G T s : \\Ai — £ a || — \\Ai — £ r || > /3||Cs ~~ Cr|||- ?7ien |X| < 
256 f^ ^fc(min{n r ,n s }). 

Proof. Let V be the subspace spanned by the following 4 vectors: /x a , Cr, Cs}- Denote Py as the 
projection onto V. We denote Vi = Py{Ai), and observe that Jy(/ir) = /x r , and the same goes for 
/x s , Cr and Cs- Observe also that, as a projection, ||i~y(.A — C)\\ < \\A — C\\ (alternatively, ||-Fy|| = 1). 

We now make a simple observation. Let Ai denote the projection of Ai onto the line connecting 
\i r and (i s . Now, the inequality — /x a || < ||^4j — /i r || holds iff the inequality — Ms|| < 1 1 Ai~ Mr \ \ 
holds (because \\Ai — Mr || = \\Ai — Ai\\ 2 + \\Ai — Mr|| 2 )- Furthermore, such relation holds for any 
point whose projection on the line connecting fi r and fi s is identical to Ai. In particular, if W is 
any subspace containing /i r and fi s , then the projection of Ai onto W is closer to \x r than to fi s 
iff Ai is closer to fi r than to /i s . Thus, since \\Ai — [i s \\ < \\Ai — fi r \\ then \\vi — /i s \\ < \\vi — fx r \\. 
Furthermore, as Cr and Cs also belong to V, then the projection of Ai onto the line connecting Cs 
and Cr is identical to the projection of Vi onto the same line (meaning, Ai = Vi). So Vi also satisfies 
the inequality: \\vi — Cs|| — W&i — Cr\\ > PWCs — Cr||, and, of course, \\vi — Cr|| 2 = \\vi — Vi\\ 2 + \\di — Cr|| 2 - 

The proof follows from upper- and lower-bounding the term \\v i — Cs|| 2 — — Cr|| 2 - We've just 
shown a lower bound, as we have that 

\\Vi ~ Cs|| 2 " H - Cr|| 2 = (\\vi ~ Cs|| " \\Vi ~ Cr||) (\\Vi ~ (s\\ + \\Vi ~ Cr||) > P\\(s ~ Crf 

The triangle inequality gives that \\vi — Cs|| < ll^i ~~ Ms 1 1 + ot{A r + A s ), and that \\vi — Cr|| > 
\\vi — Mr|| ~~ a{A r + A a ), so we have the upper bound of 

IK - Cs|| 2 - \\vi - Cr- 1| 2 < (IK - Ms 1 1 + a(A r + A s )) 2 - - /x r || - a(A r + A s )) 2 

< (||^ - fi r \\ + a(A r + A s )) 2 - (\\ Vi - Hr\\ - a(A r + A,)) 2 
<4a(A r + A s )||t>i-^ r || 

Comparing the upper and the lower bound, we have that for any i € X the distance \\vi — fi r \\ > 
j3_ (c a)^(AH-A s ) ^ As X C T s , the Markov inequality concludes the proof 

\X\ (^Vk\\A - C\\) 2 — -ji <^\\vi-^\\ 2 < \\P V (A - C)\\ 2 F < 4\\A - C\\ 2 □ 

\8 a J mm{n r ,n s } 

Proof of Lemma \4-3\ Every i G T s ^, r must satisfy that \\Ai~ /t a || > 2||^4j — fi r \\ (Proposition ^. 4p . 
Therefore, we must have that \\Ai — fi s \\ > 2\\Ai — /2 r ||, where we denote Ai as the projection of 
A onto the line connecting fi r with fi s (simply because \\Ai — fl s \\ 2 = \\Ai — Ai\\ 2 + \\Ai — fis\\ 2 -) 
Therefore, \\fi r — fi s \\ < ^\\Ai — fi s \\, so \\Ai — (i s \\ — \\A{ — ft r \\ > ^\\(i r — fi s \\- 

Thus, every i G T s _j. r satisfies the conditions of Lemma 14.51 with Cr = An Cs = Ms> and f3 = 1/3. 
We deduce the |T s _> r | < a 2 2 ^qr min{n r , n s }, where a is the bound s.t. for every r, \\fi r — fi r \\ < 

a^=\\A — C\\. Since a < we conclude the proof. 

The fact that a is small was proven by Achlioptas and McSherry (Theorem 1 of [AM05j ). 
Denote u r as the indicator vector of T r . Since rank(C) < k, we get 

1 1 1 

H/ir - p, r \\ = — \\(A - ^) T n r || < — ||n r || \\A - A\\ < —=\\A - C\\ □ 
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As an interesting corollary, Theorem 14.11 dictates that for every r we have that \\fj, r 
0(l/c)\\n r - p, r \ 



4.1 The Proximity Condition — Part III of the Algorithm 

Part II of our algorithm returns centers 9\,...,9k which are O ( ) 1 1 A — C|| close to the true 
centers. Suppose we use these centers to cluster the points: Q s = {i : Vs', \\A{ — 9 S \\ < \\Ai — 9 S >\\}. 
It is evident that this clustering correctly classifies the majority of the points. It correctly classifies 
any point i £ T s with \\Ai — /j, r \\ — \\Aj — fi s \\ = ft( c ^ j - )\\A — C\\ for every r ^ s, and the analysis 

of Theorem 13.11 shows that at most 0(c _2 )-fraction of the points do not satisfy this condition. In 
order to have a direct comparison with the Kumar-Kannan analysis, we now bound the number of 
misclassified points w.r.t the fraction of points satisfying the Kumar-Kannan proximity condition. 

Definition 4.6. Denote gap r s = (-7= H — 7=) 11^4 — CII. Call a point i G T s 7-good, if for every 
r / s k have that the projection of Ai onto the line connecting /i r and fi s , denoted Ai, satisfies 
that \\Ai — /U r || — \\Ai — /j> s \\ > 7 gap rtS ; otherwise we say the point is 7-bad. 



c\/k— 7 



Corollary 4.7. // the number of '7 -bad points is en, then (a) the clustering {0i, . . . , 0^} misclas- 
sifies no more than + ^p^j n points, and (b) e < O ^(c — -2=)~ 2 ^ , assuming 7 < c\fk. 

Proof. Clearly, all en bad points may be misclassified. In addition, for every r and s ^ r, 
Lemma [4.51 (setting ( r = 9 r , ( s = 9 S , a = 1/cVk and j3 = ^l(^/(cVk))) proves that no more 
than 0(j~ 2 c~ 2 k~ l )n s good points can be misclassified. Summing Yl,s^r \ n s < n i we conclude (a). 

The proof of (b) is similar to the proof of Theorem 13. 1[ We look at the /c-means cost of \\A — C\\ 2 F . 
We show that all 7-bad points contribute a large amount to this cost. 

Take Ai to be a 7-bad point from T s . Projecting it down to the line connecting fi r and fj, s , we 
denote the projection as Ai. Clearly, \\fi r — /i s \\ = \\[i r — Ai\\ + \\Ai — fi s \\ > c\fkgap r ^ s whereas 
||pi r — Ai\\ — \\Ai — /i s || < ^gapr^. It follows that \\A{ — « s || > \\Ai — > ^(cVk — ^gapr^ > 

A — C||. Again, the Markov inequality gives that 

(c\f~k — ry) 2 

#{bad points from T s }^— ^\\A - C\\ 2 < \\A - C\\ 2 F < 8k\\A - C\\ 2 

so from each cluster, only a fraction of 32 ^ J-^ ^ of the points can be bad. □ 

Observe that Corollary 14.71 allows for multiple scaled versions of the proximity condition, based 
on the magnitude of 7. In particular, setting 7 = 1 we get a proximity condition whose bound is 
independent of k, and still our clustering misclassifies only a small fraction of the points - at most 
0{c~ 2 ) fraction of all points might be misclassified because they are 1-bad, and no more than a 
0(c~ 4 )-fraction of 1-good points may be misclassified. In addition, if there are no 1-bad points 
we show the following theorem. The proof (omitted) merely follows the Kumar-Kannan proof, 
plugging in the better bounds, provided by Lemma 1431 

Theorem 4.8. Assume all data points are 1-good. That is, for every point Ai that belongs to the 
target cluster and every s 7^ c(i), by projecting Ai onto the line connecting fi c ^ with fi s we have 

that the projected point Ai satisfies \\Ai — fi c ^ \\ — \\Aj — [i s \\ = O, ^( 1 = + 7y|=)) 1 1 ^4 — C 1 1 7 whereas 
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\\fJ- c (i) ~ Mall = ^ (y^^ y/ n ( ) H^~ — ^H' ^ e Lloyd method, starting with 9i, ... ,0k, 

converges to the true centers. 

5 Applications 

Clustering a mixture of Gaussians For a mixture of k Gaussians, we quote the suitable 
results without proof, as the proof is identical to the proof in |KK10j . We are given a mixture of k 
Gaussians, F\, . . . , F^, where the standard deviation of each distribution in any direction is at most 
a r , and the weight of each distribution is w r . We denote <7 max = max r {<r r } and w m \ n = mm r {w r }. 

Theorem 5.1. Suppose we are given a set of n 3> — — samples from a mixture of k Gaussians, 

^min 

such that for every r ^ s it holds that \\a r — a s \\ > c<r max w — — poly log ( — — ) . Then w.h.p. these 

y w min V mill / 

points satisfy the proximity condition. 

For Gaussians, the best known separation bound is Achlioptas and McSherry's bound |AM05] 
of ri(<T max (t(; mi 1 1 ( 2 + A/fclog(fc • min{n, 2 k }) )). As we assume k is large, this separation condition 
is 0(cr max (u; m ^ 2 + Vk)) = &(o~ max /y/w m i n ). Therefore, the separation bound of Theorem 15.11 is 
Vk times worse than the best known bound. However, applying Kumar and Kannan's boosting 
technique (Section 7 in |KK10| ). that replaces the polynomial dependency in w m i n with a logarithmic 
one, we get: 

Theorem 5.2. Suppose we are given a set of n S> samples from a mixture of k Gaussians, 
such that for every r ^ s it holds that 



I Mr - Mall > ca max Vk poly log ( — — j 

\ ^min / 



Then there exists an algorithm that w.h.p. correctly classifies all points. 

Therefore, if for any r and r', both a r ~ o~ r i and w r ~ w r i, then both |AM05j and Theorem 15.21 
give roughly the same bound. If for any r and r' we have that oy « a r r, yet w m i n <C 4, then 
Theorem 15.21 provides a better bound. If for any r and r' we have that w r ~ w r i, yet the directional 
standard deviations of the distributions vary, then the bound of |AM05j . in which the distance 
between any two cluster centers depends only the parameters of these two distributions, is the 
better bound. If both the standard deviations and the weights vary significantly between the 
different distributions, then better bound is determined on a case by case basis. 

McSherry's Planted Partition Model. In the Planted Partition Model |McS0 1 1 lA"K94l IAKS98] 

our instance is a random n-vertex graph generated by using an implicit partition of the n points 
into k clusters. There exists an unknown k x k matrix of probabilities P, and for every pair of 
vertices u, v there exists an edge connecting u and v w.p. P rs (assuming u belongs to cluster r and 
v to cluster s). The goal here is to recover the partition of the points (thus - recover P). Viewing 
this graph as a n x n matrix, each row is taken from a special distribution F r over {0, l} n - where 
each coordinate j is an independent Bernoulli r.v. with mean P r ctt)i denoting C(j) as the cluster 
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j belongs to. Thus, the mean of this distribution, Li r , is a vector with its j-coordinate set to P r puy 
Denote w m \ n = min r {^} and o" max = max rjS \/P rs . The result of |McS01j is that if for every r ^ s 



\Ll r Ll s 



1 



UK 



+ log(ra/5) 



(2) 



then it is possible to retrieve the partition of the vertices w.p. at least 1 — 5. 

Kumar and Kannan were not able to match the distance bounds of McSherry, and required 
centers to be yk factor greater then the bound of (J2J). Here we match the bound of McSherry 
exactly. Following the proof in Kumar-Kannan (with few changes), we prove: 

Theorem 5.3. Assuming that cr max > 3 lo ^ n ) and that the planted partition model satisfies equa- 
tion^ for every r ^ s, then w.p. at least 1 — 5, every point satisfies the proximity condition. 

Proof. We follow the proof of Kumar-Kannan, making the suitable changes. McSherry (Theorem 
10 of [McSOlp showed that w.h.p. \\A - C\\ < 4a 

maxV"- So our goal is to show that, w.h.p., all 
points are y/k-good. I.e., denoting u as a unit-length vector connecting \i r and ll s , we show that 
w.h.p. that for every i G T r we have 



I (Ai - \l r 



0(Vka T 



UK 



log(n/<5) 



Ms— Mr 



, and due to the special structure of the means in this model, we have that 



Observe u 

||/is— Mr 

{fM r — Hs)j = P r t — Pst where j 6 T t . It follows that 



\fJ>r Ms 



y~]nt(Prt 



p. 



at I 



t=l 



We therefore have 



U < 



I Mr Ms 



p, 



si 



t=l 



Ea 

jeT t 



P 



rt 



Observe, Aij are i.i.d 0-1 random variables with mean P r t, so we expect their sum to deviate from 
its expectation by no more than a few standard deviations. Indeed, Kumar and Kannan prove that 
w.h.p. it holds that for every t we have 



5>. 

jeT t 



Pt 



1 



ILK, 



+ log(n/5) 



where B is some sufficiently large constant. This allows us to deduce that 



\{Ai - fJL r ) ■ u\ < B(Jv 



< BVkO-r, 



1 



UK, 



+ \og(n/5) 



Etl V^t\Prt-P 



st 



Ell nt(Prt~ Pst? 



ILK- 



+ log(n/<5) 



where the last inequality is simply the power-mean inequality. 



□ 
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6 An Open Problem 



Our work presents an algorithm which successfully clusters a dataset, provided that the distance 
between any two cluster centers meets a certain lower bound. We would like to point out one 
particular direction to improve this bound. Note that our center separation bound depends on 
\\A — C\\ , a property of the entire dataset. It would be nice to handle the case where the separation 
condition between \i r and fj, s depends solely on T r and T s . That is, if we define A r = -^=11^ — C r \\, 

is it possible to successfully separate clusters s.t ||/i r — fi s \\ > c(A r + A s )? We comment that most of 
our analysis (and particularly Lemma l4"3j) builds only on the ratio between \\fi r — v r \\ and \\/j, r — fi s \\ 
- we assume the first is no greater than oA r and that the latter is no less than c(A r + A s ). In fact, 
one can revise the proofs of Theorems 13.11 and 13.21 so that they will hold based on this assumption 
alone (without using the properties of the SVD). The problem therefore boils down to finding 
initial centers {f r } that are sufficiently close to the true centers {/v}> under the assumption that 
Vr ^ s, \\[i r — [J>s\\ > c(A r + A s ). But this is an intricate task, mainly because such separation 
condition does not imply that {/xi,/i2, • • • , /•*&} are the centers minimizing the fc-means cost! (Nor 
do {/ii, //2, • • • j Mfe, } minimize the fc-means cost of A.) Consider the case, for example, where 
cluster r has very few points (say n r = sjn) and very small variance, and cluster s is very big 
(say n s = n/5), and is essentially composed of two sub-components with distance 2 ^- ||^4 s — C s || 
between the centers of the two sub-components. The /c-means cost of placing two centers within 
C s is smaller than placing one center at /U s and one center at fi r . This relates to the question of 
designing a t-approximation algorithm for fc-means, guaranteeing that each cluster's cost cannot 
increase by more than a factor of t. 
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A Some Basic Lemmas 

Fact A.l (Lemma 9 from [McSOlJ). ||i-C||£, < 8 min{A;p-C|| 2 , ||A-C|||} (= 8n r A 2 for every i 
Proof. 

\\A - Cf F < 2k\\A - C\\ 2 < 2k (\\A - A\\ + \\A - C\\) < 2k (2\\A - C\\) 2 



where the first inequality holds because rank(A — C) < 2k, and the last inequality follows from the 
fact that A = argmin A r :rank ( A f) =fc {|| J 4 — N\\}. For the same reason, \\A — C\\f < \\A — A\\p + \\A — 
C\\ F < 2\\A-C\\ F . □ 

Fact A. 2 (Claim 1 in Section 3.2 of |KV09j ). For every /i r there exists a center v s s.t. ||/i r — z/ s || < 
6A r , so we can match each \i T to a unique v r . 
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Proof. Observe that by taking A — C, we project A — C to a fc-dimensional subspace, so we have 
that \\A - C\\ 2 F < k\\A - C\\ 2 < k\\A - C\\ 2 . Similarly, \\A - Cf F < \\A - C\\ 2 F . 

Assume for the sake of contradiction that 3r s.t. \\p r — v s \\ > 6A r for all s. Since \\A — C\\ F < 
n r A 2 , then our 10-approximation algorithm yields a clustering of cost < 10n r A 2 . In contrast, as 
each Ai is assigned to some v c u\, the contribution of only the points in T r to the /c-means cost of 
the clustering is more than 



2 



Yl ~ v <i)) ~ -fr) > y (6A r ) 2 - ^ \\Ai - p r \\ 2 > 18n r A 2 - || A - Cf F > Wn r A r 

where the first inequality follows from the fact that (a — b) 2 > \o? — b 2 . □ 

Now, in order to prove Fact 11.31 (also cited below as Fact IA.4p . we need the following Fact. 

Fact A. 3 (Lemma 5.2 and Corollary 5.3 from |KK10j ). Fix any cluster T r and a subset X C T r . 
Then 

\X\ \\p(X) - fj, r \\ = (|T r | - \X\) \\n(T r \ X) - frW < \f\X~\ \\A r - C r \\ 
Proof. Let ux be the indicator vector of X. Then 

|| \X\ (ji(X) - fir) || = || (A - C r ) T U X \\ < \\(A r - C r ) T \\ \\u X \\ = \\A r - C r \\y/\X\ 

\x\ 

and the fact that \X\ \\fJ>(X) — fj, r \\ = \T r \ X\ \\fJ>(T r \ X) — fi r \\ is simply because \x r = 44 //(X) + 

Fact A. 4. Fix a target cluster T r and let S r be a set of points created by removing p ou tn r points 
from T r and adding pi n (s)n r points from each cluster s ^ r, s.t. every added point x satisfies 

\\x - (J-s\\ > §||a; — Mr- 1| - Assume p out < \ and p in d = ^2 s ^ r Pin(s) < \. Then 

|| M (5 r )-/x r ||<-|= ^^+§£vOT)j \\A-C\\< (J^+lVky^j \\A-C\\ 



Proof. We break \\p(S r ) — p r \\ into its components and deduce 

\MSr) -t*r\\ < {l ~ P ° Ut)nr \\p{Sr H T r ) - p r \\ + £ IIM^ D T s ) - p r \\ 

< {1 ~ Pout)nr \HS r n T r ) - p r \\ + f Y, ^^MSr n T s ) - p s 

Thr , r\r 

s^tr 



Plugging in Fact [A3] we have \\p(S r )-p r \\ < i (yp ou t?v + § Y, s ^r V Pin(s)n r j \\A-C\\. The last 
inequality comes from maximizing the sum of square-roots by taking each Pi n (s) = Pi n /k. □ 



17 



